College of Information Science, University of Arizona
Python libraries
Abstract
Music recommendation systems increasingly rely on machine learning to capture the complexity of user preferences, yet existing models struggle to account for language diversity and nuanced audio features in songs. This project applies signal processing, vocal separation (DEMUCS library), and machine learning techniques to develop a framework for classifying both music genres and song languages, integrating these predictions with genre metadata for improved personalization. By combining automated data collection with advanced audio analysis, the system provides a foundation for smarter, more inclusive recommendation platforms that enhance user experience across diverse musical contexts. The focus of the project was: langauge and genre recognition. For language recognition, classical models—including Logistic Regression, Random Forests, and SVMs—were trained on extracted statistical and time-frequency features using 5-fold cross-validation. These models showed modest predictive performance, with accuracy, precision, recall, and F1-scores generally ranging from 10–60%, while vocal features provided stronger signals than instrumental components. Next, KNNs and Random Forests were applied with ‘genre’ as the target variable. Finally, CNNs were applied to Mel spectrogram images—both grayscale and color scale—with train/validation/test splits, early stopping, and hyperparameter sweeps to capture complex audio patterns. While all models had limited performance, CNNs have strong theoretical potential, as reported in the literature, and improved recognition compared to classical models, highlighting the promise of deep learning and feature engineering for future music recommendation and language identification systems.
Introduction
Music genre classification is a central task in the field of music information retrieval, combining elements of signal processing, machine learning, and deep learning. Accurate genre identification not only enhances music recommendation systems and streaming platforms but also deepens our understanding of audio structure and human perception of sound. Traditional approaches have relied on handcrafted audio features analyzed with machine learning techniques such as Random Forests and Gaussian Mixture Models, offering interpretable yet limited performance [1]. Recent advances, however, leverage deep learning methods—particularly convolutional neural networks (CNNs)—to extract high-level representations directly from spectrograms, achieving state-of-the-art results [2]. This project explores both paradigms: first applying classical machine learning with 5-fold cross-validation, and then advancing to CNN-based classification on spectrogram heat maps, with results evaluated using standard metrics including accuracy, precision, recall, F1-score, confusion matrices, and ROC curves.
A note on spectographic features of .mp3 vs. .wav
Initially, it was assumed that the typical size difference between .mp3 and .wav files would reflect meaningful differences in their spectral properties. However, analysis revealed that the differences were minimal, leading us to abandon the idea of using the two file types as comparative baselines for our models. As shown in Figure 1, the typical similarity between .wav and .mp3 files exceeds 98%, rendering the expectation of differing training results between the two data types a moot point.
Figure 1: Frequency histogram comparison between MP3 and WAV audio files.
A note on project software
It’s generally easier and faster to run Python scrips separately and use the results in the discussion of the results. All scripts used are stored in: Herling-Mi_extra\0_mL_scripts\0_p1 Herling-Mi_extra\0_mL_scripts\0_p2
Questions
1. Language Recognition with Separated Vocal & Audio Tracks
Initial Problem Formulation
How can we leverage statistical and time-frequency features extracted from separated vocal and audio tracks to build effective language recognition models? Specifically, how can traditional machine learning methods — ranging from classical classifiers on simple statistical summaries to Gaussian Mixture Models on richer time-frequency features — be applied in this context?
What are the key benefits and limitations of these approaches?
How can careful feature engineering, feature integration, and thorough model evaluation improve the accuracy and robustness of language recognition systems?
How do model results compare and contrast when using .wav files versus .mp3 files?
Secondary Problem formulation
From the initial formulation, we refined the question to specifically compare how different ablations of the audio track (complete song, vocal-only, and non-vocal) affect model performance.
How does model performance differ when predicting song language using features from complete songs, vocal-only tracks, and instrumental-only tracks?
What are the relative strengths and limitations of classical machine learning models (Logistic Regression, Random Forest, SVM) when applied to language recognition?
2. Recommendation Systems Using Audio Features & User Data
Initial Problem Formulation
How can user interaction data, combined with basic track metadata and simple audio features, be used to build an effective recommendation system using collaborative filtering and traditional machine learning methods?
Furthermore, how can advanced audio features, dimensionality reduction, and clustering techniques improve personalized recommendations by better capturing user preferences and track characteristics from both vocal and non-vocal components?
How do recommendation model results compare and contrast when using .wav files versus .mp3 files, considering the potential impact of audio quality and compression artifacts on feature extraction and recommendation performance?
Secondary Problem Formulation
We abandoned the use of .wav versus .mp3 formats for the reasons previously mentioned. Instead, the idea of using heat maps/spectrograms was discovered and pursued. A CNN was built for both grayscale and viridis-scale inputs. As will be discussed in the Problem Analysis and Results section, training outcomes were poor despite what initially seemed to be a solid approach. The current hypothesis is that the dataset is too small to produce a robust model, that the extracted song metrics are insufficient to support effective training, or a combination of both. The Likert scale—with options of ‘Likert 2,’ ‘Likert 3,’ and ‘Likert 5’—was not employed, as it represents a second phase of recommendation by genre, contingent on reliable genre recognition. ## Dataset
data provenance
The data collection process involved several custom Python scripts designed to scrape and download the necessary information and audio files:
artist_5_song_list_scrape.py — Retrieves the top five songs per artist from Google search results.
artist_genre_scrape.py — Gathers genre metadata for each artist from public sources.
artist_country_of_origin_scrape.py — Extracts the country of origin for each artist.
audio_scrape_wav_mp3.py — Downloads audio files from YouTube links in WAV and MP3 formats.
Together, these scripts automate the extraction of both audio data and relevant metadata to support training and evaluation of the recommendation system.
For question 1 A total of 123 songs were scraped, each was turned into the triplicate of (1) complete song, (2) audio only (3) vocal only. Yielding 369 observations to work with. The complete extraction pipeline for question 1 was around 15 hours.
For question 2 A total of 20 genres were examined, each with 10 example songs from relevant artists - yielding a set of 200 observations to work with. The complete extraction pipeline for question 2 was around 5 hours - due to the use of parallel threads in a reconfigured software file.
software distriubtion
Initially, the plan was to distribute a software package to both partners so they could each collect their song files and extract the data locally. However, due to the ambitious goals and the multifaceted software requirements needed to accomplish them, the team soon felt as if we were flying the plane while building it. Ultimately, one team member (Nathan) took responsibility for collecting the song file data and generating the features. These features were then distributed to other members, replacing the originally envisioned feature-extraction software suite.
data features
In addition to the artist and song name, the features listed in Table 1 were scraped for each track. The final selection of features was guided as much by curiosity—‘I wonder what this will do’—as by deliberate planning. Research was done - but, until you try to build the model yourself you’re not aware of what works under what condtions.
🔍 Feature Scraping
Feature
Description
fundamental_freq
Fundamental frequency (mean pitch via librosa.pyin)
freq_e_1
Dominant spectral energy #1 (highest energy frequency bin)
freq_e_2
Dominant spectral energy #2 (2nd highest energy frequency bin)
freq_e_3
Dominant spectral energy #3 (3rd highest energy frequency bin)
key
Estimated musical key (C, C#, D, …, B) via chroma features
duration
Length of audio in seconds
zero_crossing_rate
Average zero crossing rate (signal sign changes)
mfcc_mean
Mean of 13 MFCC coefficients (timbre features)
mfcc_std
Standard deviation of MFCC coefficients
tempo
Estimated tempo in beats per minute (BPM)
rms_energy
Root mean square energy (loudness measure)
track_type
Audio track type (0=full mix, 1=vocal only, 2=no vocals)
mel_spectrogram
Mel-scaled spectrogram representing frequency content over time (human hearing range)
Table 1: Extracted Audio Features
data storge
This JSON schema, as proposed in the original plan, was refactored as needed. The multiple pipeline components required to gather and merge the information made it more expedient to use a combination of the .json design and .csv files. Shown below is the main .json design used for the project.
After the initial .wav data collection, the following '2nd tier' data were generated/collected:
.mp3
Separated vocal track
Separated audio track
Spectrogram data (viridis scale)
Spectrogram data (grey scale)
The Demucs library turned out to be easy to use and very good at vocal and background track separation. Scripts that run Demucs using system commands—typically through Python’s subprocess or os libraries—offer a straightforward way to integrate audio separation tools into Python workflows while interacting with the operating system’s file structure and command-line utilities. In order to run files in parallel, a main_script.py capable of generating mulitple threads would call a worker_script.py such as the one below.
Below are two representatives of spectrogram feature extraction.
image 2. Billie Eilish - bad guy - spectrograph - vridis scale
import osimport subprocess# Path to your input audio fileaudio_file =r"~\Gloria_Gaynor_I_Will_Survive.wav"# Optional: check if file existsifnot os.path.exists(audio_file):raiseFileNotFoundError(f"Audio file not found: {audio_file}")# Build the Demucs command# You can change --two-stems to 'drums' or 'bass' if neededcommand = ["demucs","--two-stems=vocals", # Extract vocals only"--out", "demucs_output", # Output folder audio_file]# Run the commandprint("🔄 Running Demucs...")subprocess.run(command)print("✅ Separation complete. Check the 'demucs_output' folder for results.")
Team member workload
Our project workload followed a structured week-by-week workflow as proposed in the initial proposal, with responsibilities distributed among team members. We began by finalizing and sharing the proposal, followed by the individual collection and organization of ~200 audio files per person. Nathan Herling led the processing and validation of metadata, while each member focused on building machine learning pipelines and conducting iterative testing. The project concluded with a collaborative effort on final model evaluation, report preparation, and presentation development.
Problem analysis and results
General
Easy and medium paths, as proposed in the proposal morphed into multiple paths to tackle the problem, some bearing fruit, some not. In Q1, it could be argued that three easy paths were taken in an attempt to explore which may work better, and in Q2 two easy paths and a medium path were explored.
Q1 - Yashi
How can we leverage audio features from separated vocal and instrumental tracks to improve language recognition in music?
Data Collection: The dataset consisted of ~200 audio files, preprocessed into three ablations: complete songs, vocal-only tracks, and instrumental-only tracks. Features included time-domain statistics (mean, variance, skewness, kurtosis).
Data Processing: All features were standardized using global scaling. Encoded target variable (language) with LabelEncoder.No major imputation was required as missingness was minimal.
Model Selection: I evaluated three models: Logistic Regression, Random Forest, and Support Vector Machines with linear kernels. These models were chosen for their balance of interpretability, robustness, and suitability for structured feature data. Training and evaluation were conducted using 5-fold stratified cross-validation to ensure reliable performance comparisons across models.
Validation & Metrics: Evaluation focused on accuracy, precision, recall, and F1-score. Confusion matrices were used to analyze per-class misclassification patterns.
Model Evaluation:
Ablation
Model
Accuracy
Precision
Recall
F1
complete_song
LogReg
0.399
0.398
0.447
0.387
complete_song
RandomForest
0.626
0.469
0.410
0.401
complete_song
SVM_linear
0.432
0.443
0.504
0.427
vocal_only
LogReg
0.560
0.531
0.567
0.509
vocal_only
RandomForest
0.552
0.426
0.404
0.385
vocal_only
SVM_linear
0.544
0.542
0.583
0.514
no_vocal
LogReg
0.333
0.371
0.364
0.316
no_vocal
RandomForest
0.577
0.436
0.349
0.328
no_vocal
SVM_linear
0.366
0.418
0.411
0.347
Column Min
-
0.333
0.371
0.349
0.316
Column Max
-
0.626
0.542
0.583
0.514
Results:
Vocal-only tracks: Provided the best classification signal, with SVM achieving ~0.51 macro F1, outperforming Random Forest and Logistic Regression.
Complete songs: Models achieved moderate performance (~0.40 F1), reflecting a mixture of useful vocal cues diluted by instrumental content.
Non_vocal tracks: Accuracy dropped to ~0.50 (random baseline), validating the expectation that language recognition requires vocal content.
Future reommendations: larger data sets and more hyperparameter exploration
Q2
How can we leverage audio features to construct a machine learning model capable of genre recognition?
Data Collection: Data collection was performed with python scripts for each feature listed in Table 1. Ten genres were chosen and twenty representative artists for each genre - were both choosen by google search.
Data Processing: All data was present, no imputation was needed. It was decided to not eliminate statistical outliers, since the model design hasn’t been explored thoroughly enough to warrant selecting out data.
Model Selection: Three supervised machine learning methods were chosen: (1) Knn [classic baseline] (2) Random Forest [with the hope of good baseline results] (3) CNN performed with the numerical dataset and spectograph extracted files.
Model Validation:
(1) KNN
LOOCV: Validates and selects the best hyperparameters.
5-Fold CV learning curve: Validates generalization performance as a function of training size.
Knn Evaluation This learning curve reveals a significant gap between training and cross-validation performance for your KNN classifier:
🔵 Training Score: The model achieves a perfect F1 score of 1.0 across all training set sizes, which is a strong indicator of overfitting—the model memorizes the training data rather than generalizing from it.
🟢 Cross-Validation Score: Starts near 0.0 and only climbs to about 0.2 even with 160 training samples. This suggests the model struggles to generalize and perform well on unseen data.
📉 Implication: Despite using the best hyperparameters, the model may be too sensitive to noise or lacks sufficient complexity to capture meaningful patterns. KNN’s reliance on local structure might be failing due to sparse or high-dimensional data.
Metrics and Results (Random Forest)
Hyperparameter Sweep - RF
Learning Curve - RF
Metric
Value
max_depth
10.0
min_samples_leaf
1.0
min_samples_split
2.0
n_estimators
200.0
accuracy
0.925
precision
0.927554
recall
0.925
f1_score
0.924728
Random Forrest Analysis Yes, this random forest model appears to be overtrained, and here’s why:
🔍 Key Indicators of Overtraining
Training Accuracy = 1.0 across all training set sizes:
This suggests the model is memorizing the training data perfectly, which is a classic sign of overfitting.
Validation Accuracy starts low (~0.1) and rises to ~0.85:
While the validation accuracy improves with more data, the persistent gap between training and validation accuracy indicates poor generalization early on.
Even at the largest training size, the model still performs significantly worse on unseen data than on training data.
📈 What a Healthy Learning Curve Might Look Like
Training accuracy should decrease slightly as training size increases (less memorization).
Validation accuracy should increase and converge toward training accuracy.
A smaller gap between the two curves suggests better generalization.
🧠 Why Random Forests Can Overfit
If the number of trees is too high or if each tree is allowed to grow too deep, the ensemble can overfit.
Especially with small datasets, random forests can memorize patterns that don’t generalize.
Metrics and Results (CNN - Grey scale)
Hyperparameter Sweep CNN - grey scale
Learning Curve CNN - grey scale
Conv Layers
Epochs
Patience
Accuracy
F1
Precision
2
10
2
0.0750
0.0143
0.0079
2
10
5
0.1000
0.0229
0.0129
3
10
2
0.1000
0.0182
0.0100
3
10
5
0.1000
0.0182
0.0100
2
15
2
0.1000
0.0186
0.0103
2
15
5
0.1000
0.0186
0.0103
3
15
2
0.1250
0.0450
0.0361
3
15
5
0.1000
0.0182
0.0100
2
30
2
0.1000
0.0182
0.0100
2
30
5
0.2000
0.0750
0.0476
3
30
2
0.1000
0.0182
0.0100
3
30
5
0.1000
0.0182
0.0100
CNN Model Assessment - Grayscale Data
🚨 Red Flags in the Learning Curve
Training Accuracy rises to 1.0 by epoch 5: The model is perfectly memorizing the training data.
Validation Accuracy stays flat at ~0.2: The model is not generalizing at all. It is essentially guessing on unseen data.
🔍 Possible Causes
Data Issues:
Grayscale input might lack sufficient contrast or features.
Labels could be noisy or mismatched.
Model Complexity: The CNN might be too deep or have too many parameters for the dataset size.
Overtraining:
No regularization (e.g., dropout, weight decay).
No early stopping.
Note: All attempts to reduce overtraining did not work. It is postulated that the dataset needs to be larger for the CNN to learn meaningful patterns.
Metrics and Results (CNN - Color scale)
Hyperparameter Sweep - Color Spectrogram - CNN
Learning Curve - Color Spectrogram - CNN
Final CNN Model Stats
Training Accuracy: 0.1937
Training Loss: 2.1730
Validation Accuracy: 0.1750
Validation Loss: 2.1234
Epoch 5: Early stopping triggered
Restoring model weights from the best epoch: 1
Final Best Model Metrics:
Accuracy: 0.1000
Precision: 0.0100
F1 Score: 0.0182
Color Spectrogram CNN Learning Curve
This graph shows a modestly improving CNN model trained on color spectrogram data, but it’s still underperforming overall.
📈 What the Learning Curve Shows:
Training Accuracy steadily increases from 0.0 to ~0.18 by epoch 4.
Validation Accuracy peaks at epoch 2 (~0.20), then slightly declines and flattens.
🧠 Interpretation:
The model is learning, but very slowly.
The validation peak at epoch 2 suggests the model briefly generalized well, but then started to overfit.
The low overall accuracy (max ~0.20) implies the model is struggling to extract meaningful features from the spectrograms.
🔍 Possible Issues:
Spectrogram preprocessing might be suboptimal (e.g., poor resolution, noisy input).
Model architecture may be too shallow or not well-tuned for this type of data.
Class imbalance or label noise could be limiting performance.
Too few epochs — the model might need more time to converge.
Note: Epoch 2 was the optimal epoch from the hyperparameter sweep. A postulated fix is to use a larger dataset to improve generalization and model performance.
(7) future steps/recommendations
Results & Conclusion
The primary goal of this project was to develop a machine learning system capable of recognizing both the language spoken in audio files and the musical genre, in order to enhance personalization in AI-driven music recommendation platforms. By accurately identifying spoken language within songs and combining this information with genre metadata, the system aims to suggest tracks that more closely align with individual user preferences. The challenge involved processing raw audio data, separating vocal from instrumental components, extracting meaningful statistical and time-frequency features, and applying both classical machine learning models and deep learning architectures to capture the underlying patterns in music.
To address these goals, the team experimented with a variety of models. Classical approaches—including Logistic Regression, Random Forests, and Support Vector Machines—were trained on extracted audio features using cross-validation, yielding modest predictive performance with accuracy, precision, recall, and F1-scores generally between 10–60%. K-Nearest Neighbors and Random Forests were applied for genre classification, while Convolutional Neural Networks were trained on Mel spectrogram images (grayscale and color), using early stopping and hyperparameter sweeps to optimize performance. Although the models demonstrated only limited overall accuracy, CNNs showed comparatively stronger results and align with literature on deep learning’s potential for audio analysis. Future improvements may include expanding the dataset size, refining spectrogram preprocessing, exploring deeper or more specialized architectures, and integrating more robust feature engineering to enhance both language and genre recognition for more effective recommendation systems.
Video links
Nathan #_to_Do
Audio Player Demo
A demo of the Ui/Ux audio player written for this project. First a few songs are scrolled through to demonstarate the functionality of ‘real time’ generation of dB v. freq. curves for .wav and .mp3. Next a song is played to demonstrat the ‘real time’ audio analysis with the spectrogram (heat map) feature.
---title: "Audio Alchemy"subtitle: "INFO 523 - Final Project"author: - name: "Nathan Herling & Yashi Mi " affiliations: - name: "College of Information Science, University of Arizona"description: "Project description"format: html: code-tools: true code-overflow: wrap embed-resources: trueeditor: visualexecute: warning: false echo: falsejupyter: python3---## Python libraries```{python}#| label: load-pkgs#| message: false#| echo: falseimport osimport jsonimport subprocessimport numpy as npimport pandas as pd# === Machine Learning & Evaluation ===import sklearn # Models, preprocessing, cross-validation, metricsimport lightgbm as lgb # Gradient boostingimport xgboost as xgb # Gradient boosting#import surprise # Consider removing if problematic, see alternatives# === Deep Learning Frameworks ===import torch # PyTorch (used with Demucs, CNNs, etc.)import tensorflow as tf # TensorFlowfrom tensorflow import keras # Keras API# === Audio Processing ===import librosa # Feature extraction (ZCR, RMS, tempo, etc.)import torchaudio # Audio I/O and transformations with PyTorchfrom demucs.applyimport apply_modelfrom demucs.pretrained import get_model # Vocal separation# === Visualization ===import matplotlib.pyplot as pltimport seaborn as sns# === Display & Formatting (for .qmd / Jupyter) ===from IPython.display import display, HTML```## AbstractMusic recommendation systems increasingly rely on machine learning to capture the complexity of user preferences, yet existing models struggle to account for language diversity and nuanced audio features in songs. This project applies signal processing, vocal separation (DEMUCS library), and machine learning techniques to develop a framework for classifying both music genres and song languages, integrating these predictions with genre metadata for improved personalization. By combining automated data collection with advanced audio analysis, the system provides a foundation for smarter, more inclusive recommendation platforms that enhance user experience across diverse musical contexts. The focus of the project was: langauge and genre recognition. For language recognition, classical models—including Logistic Regression, Random Forests, and SVMs—were trained on extracted statistical and time-frequency features using 5-fold cross-validation. These models showed modest predictive performance, with accuracy, precision, recall, and F1-scores generally ranging from 10–60%, while vocal features provided stronger signals than instrumental components. Next, KNNs and Random Forests were applied with 'genre' as the target variable. Finally, CNNs were applied to Mel spectrogram images—both grayscale and color scale—with train/validation/test splits, early stopping, and hyperparameter sweeps to capture complex audio patterns. While all models had limited performance, CNNs have strong theoretical potential, as reported in the literature, and improved recognition compared to classical models, highlighting the promise of deep learning and feature engineering for future music recommendation and language identification systems.## IntroductionMusic genre classification is a central task in the field of music information retrieval, combining elements of signal processing, machine learning, and deep learning. Accurate genre identification not only enhances music recommendation systems and streaming platforms but also deepens our understanding of audio structure and human perception of sound. Traditional approaches have relied on handcrafted audio features analyzed with machine learning techniques such as Random Forests and Gaussian Mixture Models, offering interpretable yet limited performance \[1\]. Recent advances, however, leverage deep learning methods—particularly convolutional neural networks (CNNs)—to extract high-level representations directly from spectrograms, achieving state-of-the-art results \[2\]. This project explores both paradigms: first applying classical machine learning with 5-fold cross-validation, and then advancing to CNN-based classification on spectrogram heat maps, with results evaluated using standard metrics including accuracy, precision, recall, F1-score, confusion matrices, and ROC curves.## A note on spectographic features of <code>.mp3</code> vs. <code>.wav</code>Initially, it was assumed that the typical size difference between .mp3 and .wav files would reflect meaningful differences in their spectral properties. However, analysis revealed that the differences were minimal, leading us to abandon the idea of using the two file types as comparative baselines for our models. As shown in Figure 1, the typical similarity between <code>.wav</code> and <code>.mp3</code> files exceeds 98%, rendering the expectation of differing training results between the two data types a moot point.<div style="margin: 20px 40px; text-align: center;"><img src="images/mp3_v_wav.png" alt="MP3 vs WAV" style="max-width: 80%; border-radius: 8px;"><p style="margin-top: 10px; font-size: 14px; color: #444;">Figure 1: Frequency histogram comparison between MP3 and WAV audio files.</p></div>## A note on project softwareIt's generally easier and faster to run Python scrips separately and use the results in the discussion of the results. All scripts used are stored in:<br><code>**Herling-Mi\_extra\0_mL_scripts\0_p1**</code><br><code>**Herling-Mi\_extra\0_mL_scripts\0_p2**</code><br>## Questions### 1. Language Recognition with Separated Vocal & Audio Tracks#### <b>Initial Problem Formulation</b>How can we leverage **statistical** and **time-frequency features** extracted from separated vocal and audio tracks to build effective language recognition models? Specifically, how can traditional machine learning methods — ranging from **classical classifiers** on simple statistical summaries to **Gaussian Mixture Models** on richer time-frequency features — be applied in this context?- What are the key **benefits** and **limitations** of these approaches?\- How can **careful feature engineering**, **feature integration**, and **thorough model evaluation** improve the accuracy and robustness of language recognition systems?\- How do model results compare and contrast when using <code>**.wav**</code> files versus <code>**.mp3**</code> files?#### <b>Secondary Problem formulation</b>From the initial formulation, we refined the question to specifically compare how different ablations of the audio track (complete song, vocal-only, and non-vocal) affect model performance.- How does model performance differ when predicting song language using features from complete songs, vocal-only tracks, and instrumental-only tracks?- What are the relative strengths and limitations of classical machine learning models (Logistic Regression, Random Forest, SVM) when applied to language recognition?### 2. Recommendation Systems Using Audio Features & User Data#### <b>Initial Problem Formulation</b>How can **user interaction data**, combined with basic track metadata and simple audio features, be used to build an effective recommendation system using **collaborative filtering** and traditional machine learning methods?- Furthermore, how can **advanced audio features**, **dimensionality reduction**, and **clustering techniques** improve personalized recommendations by better capturing user preferences and track characteristics from both vocal and non-vocal components?\- How do recommendation model results compare and contrast when using <code>**.wav**</code> files versus <code>**.mp3**</code> files, considering the potential impact of audio quality and compression artifacts on feature extraction and recommendation performance?#### <b>Secondary Problem Formulation</b>We abandoned the use of <code>.wav</code> versus <code>.mp3</code> formats for the reasons previously mentioned. Instead, the idea of using heat maps/spectrograms was discovered and pursued. A CNN was built for both grayscale and viridis-scale inputs. As will be discussed in the <b>Problem Analysis and Results</b> section, training outcomes were poor despite what initially seemed to be a solid approach. The current hypothesis is that the dataset is too small to produce a robust model, that the extracted song metrics are insufficient to support effective training, or a combination of both. The Likert scale—with options of 'Likert 2,' 'Likert 3,' and 'Likert 5'—was not employed, as it represents a second phase of recommendation by genre, contingent on reliable genre recognition. ## Dataset### data provenanceThe data collection process involved several custom Python scripts designed to scrape and download the necessary information and audio files:<code>**artist_5_song_list_scrape.py**</code> — Retrieves the top five songs per artist from Google search results.<code>**artist_genre_scrape.py**</code> — Gathers genre metadata for each artist from public sources.<code>**artist_country_of_origin_scrape.py**</code> — Extracts the country of origin for each artist.<code>**audio_scrape_wav_mp3.py**</code> — Downloads audio files from YouTube links in WAV and MP3 formats.Together, these scripts automate the extraction of both audio data and relevant metadata to support training and evaluation of the recommendation system.<b>For question 1</b><br> A total of 123 songs were scraped, each was turned into the triplicate of (1) complete song, (2) audio only (3) vocal only. Yielding 369 observations to work with. The complete extraction pipeline for question 1 was around 15 hours.<b>For question 2</b><br> A total of 20 genres were examined, each with 10 example songs from relevant artists - yielding a set of 200 observations to work with. The complete extraction pipeline for question 2 was around 5 hours - due to the use of parallel threads in a reconfigured software file.### software distriubtionInitially, the plan was to distribute a software package to both partners so they could each collect their song files and extract the data locally. However, due to the ambitious goals and the multifaceted software requirements needed to accomplish them, the team soon felt as if we were flying the plane while building it. Ultimately, one team member (Nathan) took responsibility for collecting the song file data and generating the features. These features were then distributed to other members, replacing the originally envisioned feature-extraction software suite.### data featuresIn addition to the artist and song name, the features listed in Table 1 were scraped for each track. The final selection of features was guided as much by curiosity—‘I wonder what this will do’—as by deliberate planning. Research was done - but, until you try to build the model yourself you're not aware of what works under what condtions.:::: cell::: {style="width: 110%; margin: 0px -20px 30px -50px; font-family: Arial, sans-serif; font-size: 1.2em; line-height: 1.2; border: 5px solid midnightblue; border-radius: 8px; padding: 8px 12px; background-color: #fdfdfd; overflow-x: auto;"}<h2 style="color: #007ACC; margin: -10px 0 0px -20px; text-align: center; font-size: 2.0em;">🔍 Feature Scraping</h2>| Feature | Description ||-----------------|-------------------------------------------------------|| fundamental_freq | Fundamental frequency (mean pitch via librosa.pyin) || freq_e_1 | Dominant spectral energy #1 (highest energy frequency bin) || freq_e_2 | Dominant spectral energy #2 (2nd highest energy frequency bin) || freq_e_3 | Dominant spectral energy #3 (3rd highest energy frequency bin) || key | Estimated musical key (C, C#, D, ..., B) via chroma features || duration | Length of audio in seconds || zero_crossing_rate | Average zero crossing rate (signal sign changes) || mfcc_mean | Mean of 13 MFCC coefficients (timbre features) || mfcc_std | Standard deviation of MFCC coefficients || tempo | Estimated tempo in beats per minute (BPM) || rms_energy | Root mean square energy (loudness measure) || track_type | Audio track type (0=full mix, 1=vocal only, 2=no vocals) || mel_spectrogram | Mel-scaled spectrogram representing frequency content over time (human hearing range) |:::::::<p style="text-align:center; font-style:italic; margin-top:-20px;">Table 1: Extracted Audio Features</p>### data storgeThis <code>JSON</code> schema, as proposed in the original plan, was refactored as needed. The multiple pipeline components required to gather and merge the information made it more expedient to use a combination of the <code>.json</code> design and <code>.csv</code> files. Shown below is the main <code>.json</code> design used for the project.```{python}#| label: json-ex#| echo: false# Your JSON dict & display code here# Your JSON data as a Python dictionarynode = {"AudioFile": {"yt_link": [],"wav_link": [],"mp3_link": [] },"Region": ["America"],"Data": {"Artist": "Lady Gaga","Song Title": "","Genre": [],"Mean (of features)": None,"Variance": None,"Skewness": None,"Kurtosis": None,"Zero Crossing Rate": None,"RMS Energy": None,"Loudness": None,"Energy": None,"Tempo": None,"Danceability": None,"Key / Key Name": "","Mode / Mode Name": "","Mel-Spectrogram": None,"Duration (ms)": None },}json_str = json.dumps(node, indent=4)html_code =f"""<div style="max-height: 400px; overflow: auto; border: 1px solid #ccc; padding: 10px; background: #f9f9f9; white-space: pre-wrap; font-family: monospace;font-size: 11px;">{json_str}</div>"""display(HTML(html_code))```## Data - Second Tier Features```{=html}<p>After the initial <code>.wav</code> data collection, the following '2nd tier' data were generated/collected:</p><ul> <li><code>.mp3</code></li> <li>Separated vocal track</li> <li>Separated audio track</li> <li>Spectrogram data (viridis scale)</li> <li>Spectrogram data (grey scale)</li></ul>```The <code>Demucs</code> library turned out to be easy to use and very good at vocal and background track separation.Scripts that run `Demucs` using system commands—typically through Python’s `subprocess` or `os` libraries—offer a straightforward way to integrate audio separation tools into Python workflows while interacting with the operating system’s file structure and command-line utilities. In order to run files in parallel, a <code>main_script.py</code> capable of generating mulitple threads would call a <code>worker_script.py</code> such as the one below.Below are two representatives of spectrogram feature extraction.```{=html}<div style="display: flex; justify-content: flex-start; align-items: flex-start; margin-bottom: 20px; margin-left: -110px;"> <div style="text-align: center; margin-right: 20px;"> <img src="_extra\0_mL_scripts\0_p2\0_grey_Mel_Train\Queen_We Will Rock You_gray.png" alt="~Q" style="width: 90%; height: auto;"> <p style="margin-left: 0px; margin-right: 5px;">image 1.<br>Queen_we_will_rock_you - spectrograph - grey scale</p> </div> <div style="text-align: center; margin-right: 50px;"> <img src="_extra\0_mL_scripts\0_p2\0_grey_Mel_Train\Billie Eilish_bad guy_viridis.png" alt="Billy~" style="width: 90%; height: auto;"> <p style="margin-left: 0px; margin-right: 5px;">image 2.<br>Billie Eilish - bad guy - spectrograph - vridis scale</p> </div></div>``````{python}#| label: ex_script_1#| message: false#| eval: false#| echo: trueimport osimport subprocess# Path to your input audio fileaudio_file =r"~\Gloria_Gaynor_I_Will_Survive.wav"# Optional: check if file existsifnot os.path.exists(audio_file):raiseFileNotFoundError(f"Audio file not found: {audio_file}")# Build the Demucs command# You can change --two-stems to 'drums' or 'bass' if neededcommand = ["demucs","--two-stems=vocals", # Extract vocals only"--out", "demucs_output", # Output folder audio_file]# Run the commandprint("🔄 Running Demucs...")subprocess.run(command)print("✅ Separation complete. Check the 'demucs_output' folder for results.")```## Team member workloadOur project workload followed a structured week-by-week workflow as proposed in the initial proposal, with responsibilities distributed among team members. We began by finalizing and sharing the proposal, followed by the individual collection and organization of \~200 audio files per person. Nathan Herling led the processing and validation of metadata, while each member focused on building machine learning pipelines and conducting iterative testing. The project concluded with a collaborative effort on final model evaluation, report preparation, and presentation development.## Problem analysis and results### GeneralEasy and medium paths, as proposed in the proposal morphed into multiple paths to tackle the problem, some bearing fruit, some not. In Q1, it could be argued that three easy paths were taken in an attempt to explore which may work better, and in Q2 two easy paths and a medium path were explored.### Q1 - Yashi<b>**How can we leverage audio features from separated vocal and instrumental tracks to improve language recognition in music?****Data Collection:** The dataset consisted of \~200 audio files, preprocessed into three ablations: complete songs, vocal-only tracks, and instrumental-only tracks. Features included time-domain statistics (mean, variance, skewness, kurtosis).**Data Processing:** All features were standardized using global scaling. Encoded target variable (language) with LabelEncoder.No major imputation was required as missingness was minimal.**Model Selection:** I evaluated three models: Logistic Regression, Random Forest, and Support Vector Machines with linear kernels. These models were chosen for their balance of interpretability, robustness, and suitability for structured feature data. Training and evaluation were conducted using 5-fold stratified cross-validation to ensure reliable performance comparisons across models.**Validation & Metrics:** Evaluation focused on accuracy, precision, recall, and F1-score. Confusion matrices were used to analyze per-class misclassification patterns.**Model Evaluation:**</b>| Ablation | Model | Accuracy | Precision | Recall | F1 ||----------------|--------------|----------|-----------|--------|-------|| complete_song | LogReg | 0.399 | 0.398 | 0.447 | 0.387 || complete_song | RandomForest | 0.626 | 0.469 | 0.410 | 0.401 || complete_song | SVM_linear | 0.432 | 0.443 | 0.504 | 0.427 || vocal_only | LogReg | 0.560 | 0.531 | 0.567 | 0.509 || vocal_only | RandomForest | 0.552 | 0.426 | 0.404 | 0.385 || vocal_only | SVM_linear | 0.544 | 0.542 | 0.583 | 0.514 || no_vocal | LogReg | 0.333 | 0.371 | 0.364 | 0.316 || no_vocal | RandomForest | 0.577 | 0.436 | 0.349 | 0.328 || no_vocal | SVM_linear | 0.366 | 0.418 | 0.411 | 0.347 || **Column Min** | \- | 0.333 | 0.371 | 0.349 | 0.316 || **Column Max** | \- | 0.626 | 0.542 | 0.583 | 0.514 |- ```{python} #| echo: false #| fig-cap: "" #| fig-width: 7 #| fig-height: 4 #| out-width: 70% #| fig-align: center #| fig-format: svg import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt data = [ {"ablation": "complete_song", "model": "LogReg", "f1": 0.387}, {"ablation": "complete_song", "model": "RandomForest", "f1": 0.401}, {"ablation": "complete_song", "model": "SVM_linear", "f1": 0.427}, {"ablation": "vocal_only", "model": "LogReg", "f1": 0.509}, {"ablation": "vocal_only", "model": "RandomForest", "f1": 0.385}, {"ablation": "vocal_only", "model": "SVM_linear", "f1": 0.514}, {"ablation": "no_vocal", "model": "LogReg", "f1": 0.316}, {"ablation": "no_vocal", "model": "RandomForest", "f1": 0.328}, {"ablation": "no_vocal", "model": "SVM_linear", "f1": 0.347}] df_f1 = pd.DataFrame(data) order_ablation = ["vocal_only", "complete_song", "no_vocal"] order_model = ["SVM_linear", "LogReg", "RandomForest"] df_f1["ablation"] = pd.Categorical(df_f1["ablation"], categories=order_ablation, ordered=True) df_f1["model"] = pd.Categorical(df_f1["model"], categories=order_model, ordered=True) plt.figure(figsize=(8,4)) ax = sns.barplot(data=df_f1, x="ablation", y="f1", hue="model") ax.set_xlabel("Track Type (Ablation)") ax.set_ylabel("F1 (macro)") ax.set_title("F1 (macro) by Track Type and Model") plt.show() ```<b>**Results:**- Vocal-only tracks: Provided the best classification signal, with SVM achieving \~0.51 macro F1, outperforming Random Forest and Logistic Regression.- Complete songs: Models achieved moderate performance (\~0.40 F1), reflecting a mixture of useful vocal cues diluted by instrumental content.- Non_vocal tracks: Accuracy dropped to \~0.50 (random baseline), validating the expectation that language recognition requires vocal content.- Future reommendations: larger data sets and more hyperparameter exploration### Q2How can we leverage audio features to construct a machine learning model capable of genre recognition?- Data Collection:<br>Data collection was performed with python scripts for each feature listed in Table 1. Ten genres were chosen and twenty representative artists for each genre - were both choosen by google search.- Data Processing:<br>All data was present, no imputation was needed. It was decided to not eliminate statistical outliers, since the model design hasn't been explored thoroughly enough to warrant selecting out data.- Model Selection:<br>Three supervised machine learning methods were chosen: (1) Knn [classic baseline] (2) Random Forest [with the hope of good baseline results] (3) CNN performed with the numerical dataset and spectograph extracted files.- Model Validation:<br>```{=html}<div style="margin-left: 20px;"> <p><strong>(1) KNN</strong></p> <ul> <li>LOOCV: Validates and selects the best hyperparameters.</li> <li>5-Fold CV learning curve: Validates generalization performance as a function of training size.</li> </ul> <p><strong>(2) Random Forest</strong></p> <ul> <li>Cross validation</li> <li>Learning curve</li> </ul> <p><strong>(3) CNN</strong></p> <ul> <li>Early stopping</li> </ul> <p><strong>Model Metrics</strong></p> <p><strong>(1) KNN - hyperparameter sweep:</strong></p> <ul> <li>n_neighbors: [1, 3, 5, 7, 9]</li> <li>weights: ['uniform', 'distance']</li> <li>metric: ['euclidean', 'manhattan']</li> <li>p: [1, 2] (only relevant if metric='minkowski')</li> </ul> <p><strong>(2) Random Forest - hyperparameter sweep:</strong></p> <ul> <li>n_estimators: [10, 50, 100, 200]</li> <li>max_depth: [None, 5, 10, 15]</li> <li>min_samples_split: [2, 5, 10]</li> <li>min_samples_leaf: [1, 2, 4]</li> </ul> <p><strong>(3) CNN - hyperparameter sweep:</strong></p> <ul> <li>conv_filters: [[32, 64], [64, 128], [32, 64, 128]]</li> <li>kernel_size: [(3,3), (5,5)]</li> <li>dropout: [0.2, 0.3, 0.5]</li> <li>learning_rate: [0.01, 0.001, 0.0001]</li> </ul></div>```## Metrics and Results (Knn)```{=html}<div style="display: flex; justify-content: flex-start; align-items: flex-start; margin-bottom: 20px; margin-left: -140px;"> <div style="text-align: center; margin-right: 20px;"> <img src="_extra/0_mL_scripts/0_p2/0_knn/hyper_param_sweep.png" alt="Hyperparameter Sweep" style="width: 130%; height: auto;"> <p>Hyperparameter Swee</p> </div> <div style="text-align: center; margin-right: 50px;"> <img src="_extra/0_mL_scripts/0_p2/0_knn/learning_curve.png" alt="Learning Curve 2" style="width: 130%; height: auto;"> <p>Learning Curve Knn</p> </div></div>``````{=html}<table border="1" cellpadding="5" cellspacing="0" style="border-collapse: collapse; text-align: center;"> <thead> <tr> <th colspan="2">Best Hyperparameters</th> </tr> </thead> <tbody> <tr> <td>knn__algorithm</td> <td>auto</td> </tr> <tr> <td>knn__n_neighbors</td> <td>15</td> </tr> <tr> <td>knn__p</td> <td>1</td> </tr> <tr> <td>knn__weights</td> <td>uniform</td> </tr> </tbody></table><br><table border="1" cellpadding="5" cellspacing="0" style="border-collapse: collapse; text-align: center;"> <thead> <tr> <th>Metric</th> <th>Score</th> </tr> </thead> <tbody> <tr> <td>Test Precision (weighted)</td> <td>0.1017</td> </tr> <tr> <td>Test F1 Score (weighted)</td> <td>0.1117</td> </tr> <tr> <td>LOOCV Accuracy</td> <td>0.1350</td> </tr> </tbody></table><br><table border="1" cellpadding="5" cellspacing="0" style="border-collapse: collapse; text-align: center;"> <thead> <tr> <th>Class</th> <th>Precision</th> <th>Recall</th> <th>F1-Score</th> <th>Support</th> </tr> </thead> <tbody> <tr><td>0</td><td>0.00</td><td>0.00</td><td>0.00</td><td>4</td></tr> <tr><td>1</td><td>0.25</td><td>0.25</td><td>0.25</td><td>4</td></tr> <tr><td>2</td><td>0.20</td><td>0.25</td><td>0.22</td><td>4</td></tr> <tr><td>3</td><td>0.17</td><td>0.25</td><td>0.20</td><td>4</td></tr> <tr><td>4</td><td>0.00</td><td>0.00</td><td>0.00</td><td>4</td></tr> <tr><td>5</td><td>0.00</td><td>0.00</td><td>0.00</td><td>4</td></tr> <tr><td>6</td><td>0.00</td><td>0.00</td><td>0.00</td><td>4</td></tr> <tr><td>7</td><td>0.20</td><td>0.25</td><td>0.22</td><td>4</td></tr> <tr><td>8</td><td>0.20</td><td>0.25</td><td>0.22</td><td>4</td></tr> <tr><td>9</td><td>0.00</td><td>0.00</td><td>0.00</td><td>4</td></tr> <tr><td colspan="4">Accuracy</td><td>0.12</td></tr> <tr><td colspan="4">Macro Avg</td><td>0.11</td></tr> <tr><td colspan="4">Weighted Avg</td><td>0.11</td></tr> </tbody></table>``````{=html}<div style="margin-left: 20px; margin-right: 20px; line-height: 1.6;"> <p><b>Knn Evaluation</b><br>This learning curve reveals a significant gap between training and cross-validation performance for your KNN classifier:</p> <p>🔵 <strong>Training Score:</strong> The model achieves a perfect F1 score of 1.0 across all training set sizes, which is a strong indicator of overfitting—the model memorizes the training data rather than generalizing from it.</p> <p>🟢 <strong>Cross-Validation Score:</strong> Starts near 0.0 and only climbs to about 0.2 even with 160 training samples. This suggests the model struggles to generalize and perform well on unseen data.</p> <p>📉 <strong>Implication:</strong> Despite using the best hyperparameters, the model may be too sensitive to noise or lacks sufficient complexity to capture meaningful patterns. KNN’s reliance on local structure might be failing due to sparse or high-dimensional data.</p></div>```## Metrics and Results (Random Forest)```{=html}<div style="display: flex; justify-content: flex-start; align-items: flex-start; margin-bottom: 20px; margin-left: -140px;"> <div style="text-align: center; margin-right: 20px;"> <img src="_extra\0_mL_scripts\0_p2\0_Random_Forest\RF_HP_Sweep_graph.png" alt="Hyperparameter Sweep" style="width: 130%; height: auto;"> <p>Hyperparameter Sweep - RF</p> </div> <div style="text-align: center; margin-right: 50px;"> <img src="_extra\0_mL_scripts\0_p2\0_Random_Forest\RF_LC.png" alt="Learning Curve 2" style="width: 130%; height: auto;"> <p>Learning Curve - RF</p> </div></div>``````{=html}<table border="1" cellpadding="5" cellspacing="0" style="border-collapse: collapse; text-align: left;"> <thead> <tr> <th>Metric</th> <th>Value</th> </tr> </thead> <tbody> <tr> <td>max_depth</td> <td>10.0</td> </tr> <tr> <td>min_samples_leaf</td> <td>1.0</td> </tr> <tr> <td>min_samples_split</td> <td>2.0</td> </tr> <tr> <td>n_estimators</td> <td>200.0</td> </tr> <tr> <td>accuracy</td> <td>0.925</td> </tr> <tr> <td>precision</td> <td>0.927554</td> </tr> <tr> <td>recall</td> <td>0.925</td> </tr> <tr> <td>f1_score</td> <td>0.924728</td> </tr> </tbody></table>``````{=html}<div style="margin-left: 20px; margin-right: 20px; line-height: 1.6;"> <p><b>Random Forrest Analysis</b><br>Yes, this random forest model appears to be overtrained, and here’s why:</p> <h4>🔍 Key Indicators of Overtraining</h4> <ul> <li><strong>Training Accuracy = 1.0</strong> across all training set sizes: <p>This suggests the model is memorizing the training data perfectly, which is a classic sign of overfitting.</p> </li> <li><strong>Validation Accuracy starts low (~0.1) and rises to ~0.85</strong>: <p>While the validation accuracy improves with more data, the persistent gap between training and validation accuracy indicates poor generalization early on.</p> </li> <li>Even at the largest training size, the model still performs significantly worse on unseen data than on training data.</li> </ul> <h4>📈 What a Healthy Learning Curve Might Look Like</h4> <ul> <li>Training accuracy should decrease slightly as training size increases (less memorization).</li> <li>Validation accuracy should increase and converge toward training accuracy.</li> <li>A smaller gap between the two curves suggests better generalization.</li> </ul> <h4>🧠 Why Random Forests Can Overfit</h4> <ul> <li>If the number of trees is too high or if each tree is allowed to grow too deep, the ensemble can overfit.</li> <li>Especially with small datasets, random forests can memorize patterns that don’t generalize.</li> </ul></div>```## Metrics and Results (CNN - Grey scale)```{=html}<div style="display: flex; justify-content: flex-start; align-items: flex-start; margin-bottom: 20px; margin-left: -140px;"> <div style="text-align: center; margin-right: 20px;"> <img src="_extra\0_mL_scripts\0_p2\0_grey_Mel_Train\HP_sweep.png" alt="Hyperparameter Sweep" style="width: 130%; height: auto;"> <p style="margin-left: 50px; margin-right: -40px;">Hyperparameter Sweep CNN - grey scale</p> </div> <div style="text-align: center; margin-right: 50px;"> <img src="_extra\0_mL_scripts\0_p2\0_grey_Mel_Train\lc_grey_.png" alt="Learning Curve 2" style="width: 130%; height: auto;"> <p style="margin-left: 50px; margin-right: -40px;">Learning Curve CNN - grey scale</p> </div></div>``````{=html}<table border="1" cellpadding="5" cellspacing="0"> <thead> <tr> <th>Conv Layers</th> <th>Epochs</th> <th>Patience</th> <th>Accuracy</th> <th>F1</th> <th>Precision</th> </tr> </thead> <tbody> <tr><td>2</td><td>10</td><td>2</td><td>0.0750</td><td>0.0143</td><td>0.0079</td></tr> <tr><td>2</td><td>10</td><td>5</td><td>0.1000</td><td>0.0229</td><td>0.0129</td></tr> <tr><td>3</td><td>10</td><td>2</td><td>0.1000</td><td>0.0182</td><td>0.0100</td></tr> <tr><td>3</td><td>10</td><td>5</td><td>0.1000</td><td>0.0182</td><td>0.0100</td></tr> <tr><td>2</td><td>15</td><td>2</td><td>0.1000</td><td>0.0186</td><td>0.0103</td></tr> <tr><td>2</td><td>15</td><td>5</td><td>0.1000</td><td>0.0186</td><td>0.0103</td></tr> <tr><td>3</td><td>15</td><td>2</td><td>0.1250</td><td>0.0450</td><td>0.0361</td></tr> <tr><td>3</td><td>15</td><td>5</td><td>0.1000</td><td>0.0182</td><td>0.0100</td></tr> <tr><td>2</td><td>30</td><td>2</td><td>0.1000</td><td>0.0182</td><td>0.0100</td></tr> <tr><td>2</td><td>30</td><td>5</td><td>0.2000</td><td>0.0750</td><td>0.0476</td></tr> <tr><td>3</td><td>30</td><td>2</td><td>0.1000</td><td>0.0182</td><td>0.0100</td></tr> <tr><td>3</td><td>30</td><td>5</td><td>0.1000</td><td>0.0182</td><td>0.0100</td></tr> </tbody></table>``````{=html}<div style="border: 2px solid #ccc; padding: 15px; margin: 20px; background-color: #f9f9f9; border-radius: 10px;"> <p><strong>CNN Model Assessment - Grayscale Data</strong></p> <p>🚨 <strong>Red Flags in the Learning Curve</strong></p> <ul> <li><strong>Training Accuracy rises to 1.0 by epoch 5:</strong> The model is perfectly memorizing the training data.</li> <li><strong>Validation Accuracy stays flat at ~0.2:</strong> The model is not generalizing at all. It is essentially guessing on unseen data.</li> </ul> <p>🔍 <strong>Possible Causes</strong></p> <ul> <li><strong>Data Issues:</strong> <ul> <li>Grayscale input might lack sufficient contrast or features.</li> <li>Labels could be noisy or mismatched.</li> </ul> </li> <li><strong>Model Complexity:</strong> The CNN might be too deep or have too many parameters for the dataset size.</li> <li><strong>Overtraining:</strong> <ul> <li>No regularization (e.g., dropout, weight decay).</li> <li>No early stopping.</li> </ul> </li> </ul> <p><strong>Note:</strong> All attempts to reduce overtraining did not work. It is postulated that the dataset needs to be larger for the CNN to learn meaningful patterns.</p></div>```## Metrics and Results (CNN - Color scale)```{=html}<div style="display: flex; justify-content: flex-start; align-items: flex-start; margin-bottom: 20px; margin-left: -140px;"> <div style="text-align: center; margin-right: 20px;"> <img src="_extra\0_mL_scripts\0_p2\0_col_Mel_Tran\HP_Sweep_Mel_col.png" alt="Hyperparameter Sweep" style="width: 130%; height: auto;"> <p style="margin-left: 50px; margin-right: -40px;">Hyperparameter Sweep - Color Spectrogram - CNN</p> </div> <div style="text-align: center; margin-right: 50px;"> <img src="_extra\0_mL_scripts\0_p2\0_col_Mel_Tran\learning_curve.png" alt="Learning Curve 2" style="width: 130%; height: auto;"> <p style="margin-left: 50px; margin-right: -40px;">Learning Curve - Color Spectrogram - CNN</p> </div></div>``````{=html}<div style="border: 2px solid #ccc; padding: 15px; margin: 20px; background-color: #f5f5f5; border-radius: 10px; font-family: Arial, sans-serif;"> <p><strong>Final CNN Model Stats</strong></p> <ul> <li><strong>Training Accuracy:</strong> 0.1937</li> <li><strong>Training Loss:</strong> 2.1730</li> <li><strong>Validation Accuracy:</strong> 0.1750</li> <li><strong>Validation Loss:</strong> 2.1234</li> <li><strong>Epoch 5:</strong> Early stopping triggered</li> <li>Restoring model weights from the best epoch: 1</li> </ul> <p><strong>Final Best Model Metrics:</strong></p> <ul> <li>Accuracy: 0.1000</li> <li>Precision: 0.0100</li> <li>F1 Score: 0.0182</li> </ul></div>``````{=html}<div style="border: 2px solid #ccc; padding: 15px; margin: 20px; background-color: #f9f9f9; border-radius: 10px; font-family: Arial, sans-serif;"> <p><strong>Color Spectrogram CNN Learning Curve</strong></p> <p>This graph shows a modestly improving CNN model trained on color spectrogram data, but it’s still underperforming overall.</p> <p><strong>📈 What the Learning Curve Shows:</strong></p> <ul> <li>Training Accuracy steadily increases from 0.0 to ~0.18 by epoch 4.</li> <li>Validation Accuracy peaks at epoch 2 (~0.20), then slightly declines and flattens.</li> </ul> <p><strong>🧠 Interpretation:</strong></p> <ul> <li>The model is learning, but very slowly.</li> <li>The validation peak at epoch 2 suggests the model briefly generalized well, but then started to overfit.</li> <li>The low overall accuracy (max ~0.20) implies the model is struggling to extract meaningful features from the spectrograms.</li> </ul> <p><strong>🔍 Possible Issues:</strong></p> <ul> <li>Spectrogram preprocessing might be suboptimal (e.g., poor resolution, noisy input).</li> <li>Model architecture may be too shallow or not well-tuned for this type of data.</li> <li>Class imbalance or label noise could be limiting performance.</li> <li>Too few epochs — the model might need more time to converge.</li> </ul> <p><em>Note:</em> Epoch 2 was the optimal epoch from the hyperparameter sweep. A postulated fix is to use a larger dataset to improve generalization and model performance.</p></div>```<br>(7) future steps/recommendations## Results & ConclusionThe primary goal of this project was to develop a machine learning system capable of recognizing both the language spoken in audio files and the musical genre, in order to enhance personalization in AI-driven music recommendation platforms. By accurately identifying spoken language within songs and combining this information with genre metadata, the system aims to suggest tracks that more closely align with individual user preferences. The challenge involved processing raw audio data, separating vocal from instrumental components, extracting meaningful statistical and time-frequency features, and applying both classical machine learning models and deep learning architectures to capture the underlying patterns in music.To address these goals, the team experimented with a variety of models. Classical approaches—including Logistic Regression, Random Forests, and Support Vector Machines—were trained on extracted audio features using cross-validation, yielding modest predictive performance with accuracy, precision, recall, and F1-scores generally between 10–60%. K-Nearest Neighbors and Random Forests were applied for genre classification, while Convolutional Neural Networks were trained on Mel spectrogram images (grayscale and color), using early stopping and hyperparameter sweeps to optimize performance. Although the models demonstrated only limited overall accuracy, CNNs showed comparatively stronger results and align with literature on deep learning’s potential for audio analysis. Future improvements may include expanding the dataset size, refining spectrogram preprocessing, exploring deeper or more specialized architectures, and integrating more robust feature engineering to enhance both language and genre recognition for more effective recommendation systems.## Video links::: {style="background-color: #4CBB17; color: black; margin: 0px 0px 0px 0px;"}<b>Nathan #\_to_Do</b>:::## Audio Player DemoA demo of the Ui/Ux audio player written for this project. First a few songs are scrolled through to demonstarate the functionality of 'real time' generation of dB v. freq. curves for <code>.wav</code> and <code>.mp3</code>. Next a song is played to demonstrat the 'real time' audio analysis with the spectrogram (heat map) feature.```{=html}<video width="640" height="360" controls> <source src="_extra/2025-08-20 19-39-35.mp4" type="video/mp4"> Your browser does not support the video tag.</video>```## sources\[1\] https://link.springer.com/chapter/10.1007/978-981-97-4533-3_6\[2\] https://arxiv.org/html/2411.14474v1